游客发表
技术总编:王子一
Stata&Python云端课程来啦!特惠腾讯
好雨知时节,课程课堂当春乃发生。上线现金平台网站为了谢谢你们长久以来的价格支持和信任,爬虫俱乐部为你们送福利啦!美丽!质量!不变Stata&Python特价课程双双上线腾讯课堂~原价2400元的特惠腾讯Python编程培训课程,现在仅需100元,课程课堂详情请查看推文。上线关于Stata,价格爬虫俱乐部推出了系列课程,美丽内容包括字符串函数、质量正则表达式、不变爬虫专题和文本剖析,特惠腾讯现金平台网站可以随心搭配,价格美丽,物超所值,更多信息可查看Stata系列推文、等。变的是价钱,不变的是课程质量和答疑服务。对报考有任何疑惑欢迎在公众号后台和腾讯课堂留言哦!
细雨霏霏柳眼开,云烟缭绕似仙台。
一江春水清悠淌,十里桃花锦绣裁。
李子柒,一个将人生书写成诗,生活在现代世外桃源的男子微博粉丝链接,让沉睡的桃源迷梦落入现实微博粉丝链接,她所诠释的“雪沫乳花浮午盏,蓼茸高笋试春盘”式的人间清欢,充满了烟火气与田园独有的甜蜜。这种惬意的生活仿佛繁华都市里的一股清泉,流入每一位粉丝的心底。
今天,小编将从数据角度出发,和你们一起看一下李子柒微博粉丝的地区分布。Start~
爬虫思路
微博粉丝用户ID爬取
首先,通过URL步入李子柒的微博粉丝页面:
通过检测查看粉丝抓包信息:
不同的粉丝页面所对应的URL:
比较两个URL可知,粉丝页面是通过URL中since_id这个参数的改变进行翻页的。因此,我们可以通过设置since_id(值域:1-250)来获取至多5000个粉丝的用户ID。
# 粉丝用户ID爬取
## 导入相关库
import re
import time
import random
import requests
from tqdm import tqdm_notebook ### 该库用于进度条的配置
def get_userid(url):
header_list = [
"Opera/12.0(Windows NT 5.2;U;en)Presto/22.9.168 Version/12.00",
"Opera/12.0(Windows NT 5.1;U;en)Presto/22.9.168 Version/12.00",
"Mozilla/5.0 (Windows NT 5.1) Gecko/20100101 Firefox/14.0 Opera/12.0",
"Opera/9.80 (Windows NT 6.1; WOW64; U; pt) Presto/2.10.229 Version/11.62",
"Opera/9.80 (Windows NT 6.0; U; pl) Presto/2.10.229 Version/11.62",
]
header = {
'user-agent': random.choice(header_list)
}
pat = 'since_id=(.*)'
with open('D:/python爬虫/李子柒微博粉丝地区分布/user_id.txt', 'w') as f:
for page in tqdm_notebook(range(1, 251), desc='进度条:'):
try:
print(url)
r = requests.get(url, headers=header)
all_user = r.json()['data']['cards'][0]['card_group']
since_id = r.json()['data']['cardlistInfo']['since_id']
for user in all_user:
f.write(str(user.get('user')['id'])+'\n')
url = re.sub(pat, 'since_id='+str(since_id), url)
time.sleep(random.randint(1, 2))
except Exception as e:
print(e)
if __name__ == '__main__':
start_url = "https://m.weibo.cn/api/container/getIndex?containerid=231051_-_fans_-_2970452952&since_id=21"
get_userid(start_url)
运行结果如下:
当进度条显示100%时,所有用户ID就早已抓取完毕啦~
接下来,我们按照前面抓取到的粉丝用户ID来获取粉丝的公开信息。
首先,导入相关库。
# 根据爬取的粉丝用户ID获取粉丝的基本公开信息
import requests
from lxml import etree
import pandas as pd
import numpy as np
import re
import time
import random
import os
os.chdir("D:\python爬虫\李子柒微博粉丝地区分布")
其次,登录旧版微博网页,进入李子柒的微博页面,获取headers信息。
headers = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"cookie": "输入自己的cookie",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.72 Safari/537.36"
}
然后,抓取粉丝公开信息。
new_url = "https://weibo.cn/u/"
data = []
count = 0
def get_id(ID):
with open(ID, 'r') as f:
user_list = f.readlines()
user_id = np.char.rstrip(user_list, '\n')
return user_id
def gethtml(url, header):
r = requests.get(url, headers = headers)
if r.status_code == 200:
return r.text
else:
print("网络连接异常")
for user_id in get_id('user_id.txt'):
try:
url = new_url + user_id
r_text = gethtml(url, headers)
tree = etree.HTML(r_text.encode('utf-8'))
user_name_xpath = "//tr/td[2]/div/span[1]/text()[1]"
user_name = tree.xpath(user_name_xpath)
Inf_xpath = "//tr/td[2]/div/span[1]/text()[2]"
Inf = tree.xpath(Inf_xpath)
focusnumber_xpath = "//div[4]/div/a[1]/text()"
focusnumber = tree.xpath(focusnumber_xpath)
fansnumber_xpath = "//div[4]/div/a[2]/text()"
fansnumber = tree.xpath(fansnumber_xpath)
data.append([user_name, Inf, focusnumber, fansnumber])
count += 1
print("第{ }个用户信息录入完毕".format(count))
time.sleep(random.randint(1,2))
except:
print("用户信息录入失败")
最后,保存数据。
file = r"D:\python爬虫\李子柒微博粉丝地区分布\粉丝公开信息.xlsx"
df = pd.DataFrame(data, columns = ['user_name', 'Inf', 'focusnumber', 'fansnumber'])
df.to_excel(file, index = None)
print("程序执行完毕")
运行结果如下:
我们所抓取到的粉丝信息不规整,不易于后续绘图所用,因此,我们须要进行数据清洗,清洗后的结果如下:
粉丝信息数据可视化
在获取粉丝数据然后,我们借助Python中的pyecharts模块来看一下李子柒微博粉丝的地区分布图。
## 导入相关库并读入数据
import pandas as pd
import numpy as np
from pyecharts.charts import Map
from pyecharts import options as opts
df = pd.read_excel("粉丝信息.xlsx")
df
地图Map
## 绘制粉丝地区分布图
address=pd.DataFrame(df['Inf'].value_counts()) ### 汇总每个地区的粉丝数量
city=np.char.rstrip(list(address.index)) ### 城市名称
Map1 = (
Map(init_opts=opts.InitOpts(width="1200px",height="800px"))
.add("",
[list(z) for z in zip(city,address['Inf'])],
"china",
is_roam = False,
is_map_symbol_show = False
)
.set_global_opts(
title_opts = opts.TitleOpts(title = "李子柒微博粉丝地区分布"),
visualmap_opts = opts.VisualMapOpts(max_ = 1500, is_piecewise = True,
pieces=[
{ "max": 1500, "min": 1000, "label": ">1000", "color": "#2F7F50"},
{ "max": 999, "min": 600, "label": "600-999", "color": "#FFFFE0"},
{ "max": 599, "min": 200, "label": "200-599", "color": "#7FFFD4"},
{ "max": 199, "min": 1, "label": "1-199", "color": "#00FFFF"},
{ "max": 0, "min": 0, "label": "0", "color": "#EE82EE"},])
)
)
Map1.render("粉丝分布图.html")
地理座标Geo
from pyecharts import options as opts
from pyecharts.charts import Geo
from pyecharts.globals import ChartType
g = (
Geo(init_opts=opts.InitOpts(width="1200px",height="800px"))
.add_schema(
maptype = "china",
itemstyle_opts = opts.ItemStyleOpts(color = "#5F9EA0", border_color = "#2F4F4F"),
)
.add("",
[list(z) for z in zip(city,address['Inf'])],
label_opts = opts.LabelOpts(is_show = False),
type_ = ChartType.EFFECT_SCATTER
)
.set_global_opts(
title_opts = opts.TitleOpts(title = "李子柒微博粉丝地区分布"),
visualmap_opts = opts.VisualMapOpts(max_ = 1500, is_piecewise = True,
pieces=[
{ "max": 1500, "min": 1000, "label": ">1000", "color": "#2F7F50"},
{ "max": 999, "min": 600, "label": "600-999", "color": "#FFFFE0"},
{ "max": 599, "min": 200, "label": "200-599", "color": "#FF4500"},
{ "max": 199, "min": 1, "label": "1-199", "color": "#6A5ACD"},
{ "max": 0, "min": 0, "label": "0", "color": "FF0000"},])
)
)
g.render("粉丝分布图3.html")
随机阅读
热门排行
涓嶆柇鏇存柊鍐呭 鍦ㄥ揩鎵嬩笂锛岀敤鎴疯鎯宠幏寰楁洿澶氱殑绮変笣锛屽氨椤昏鏈変紭璐ㄧ殑鍐呭銆傚洜姝わ紝涓嶆柇鏇存柊鑷繁鐨勫唴瀹规槸蹇呰鐨勩€侟/p>
蹇墜鎬庝箞娑ㄧ矇涓濇渶蹇笉鑺遍挶
蹇墜娑ㄧ矇涓濇渶蹇殑鏂瑰紡锛氳窡鎷嶇儹鐐广€佺洿鎾€佹墦姒溿€佽弓绮夌瓑绛夛紝姣旇緝瀹规槗鎿嶄綔鐨勫氨鏄媿鎽勭儹鐐瑰唴瀹癸紝鍏舵鏄洿鎾紝濡傛灉缁忔祹瀹炲姏姣旇緝寮哄彲浠ュ皾璇曞埛绀肩墿鎵撴鏉ユ定绮夛紝鍙﹀杩樺彲浠ュ嚭甯竴浜涚綉绾㈡椿鍔ㄦ潵浜掔矇銆侟/p>
寤虹珛鍩虹绮変笣锛屾湁鎸佺画鐨勫唴瀹瑰垱浣 闄や簡瑙嗛鏈韩璐ㄩ噺鎵嶈兘鏈夌儹搴﹀锛屽厛绉疮涓?000鍩虹绮変笣杩樻槸鍗佸垎鏈夊繀瑕佺殑銆傛瘮濡備翰鎴氥€佹湅鍙嬨€佸悓瀛︺€佸悓浜嬨€佸井淇″ソ鍙嬨€佸悓璁綍銆丵Q濂藉弸銆佷簰绮夌兢锛岄兘鍔ㄥ憳绮変笂鍘汇€侟/p>
浜掔矇銆傚湪绛夌ぞ浜ゅ钩鍙颁腑锛屽瓨鍦ㄨ繖澶ч噺鐨勪簰绮変汉缇わ紝鍙鐣欎笅浣滃搧閾炬帴锛屽氨鍙互鍜屼粬浜哄揩鎵嬩簰鍙屽嚮锛屽揩鎵嬩簰绮夋媺锛佸埛绮夈€傚緢澶氱綉绔欏拰杞欢骞冲彴涓婅繕浼氭彁渚涘揩鎵嬪埛绮変笣銆佸揩鎵嬪埛璧炪€佸揩鎵嬪埛鍙屽嚮绛夋湇鍔★紝鍏嶈垂銆佹敹璐圭殑閮芥湁銆傝瘎璁恒€侟/p>
浜掔矇浜掕禐锛氬湪寰俊缇ゃ€丵Q缇ゃ€佽创鍚х瓑绀句氦骞冲彴涓紝澶氭坊鍔犱竴浜涗簰绮夌兢蹇墜鎬庢牱涔扮儹搴?/strong>锛屾妸缇ょ埍绉版敼涓烘姈闊宠处鍙峰悕绉帮紝鍦ㄧ兢閲屽澶氫簰鍔ㄥ拰鑱婂ぉ锛屽湪缇や笂闈㈢暀涓嬩綔鍝侀摼鎺ワ紝灏卞彲浠ュ拰浠栦汉浜掑弻鍑汇€侟/p> 鎷嶆憚浼樿川鐨勪綔鍝侊細蹇墜瀹樻柟浼氫富鍔ㄧ粰鑷繁鎺ㄩ€侊紝鍐呭涓虹帇锛屾垜浠湪鎷嶆憚鐨勪綔鍝佺殑鏃朵警锛屼竴瀹氳鑰冭檻鍒拌兘瀵艰嚧绮変笣鍏遍福鐨勫厓绱犮€傚彧瑕佷綔鍝佷紭璐紝涓婄儹闂ㄦ定绮夊緢蹇笖绮惧噯銆侟/p> 蹇墜1000绮変笣鍙互寮€閫氬摢浜涘姛鑳键/p> 1銆佸揩鎵嬬矇涓濇墦鍒?000鐨勮瘽锛屼綘鍙互鍒版澀宸炰腑蹇冨紑閫氾紝鐩存挱鍗栬揣鍔熻兘銆傚姛鑳芥湁锛氭敹绀肩墿 鑷繁鎷嶆憚瑙嗛銆佽繘琛岀洿鎾紝鑻ユ湁浜虹湅銆侀€佺ぜ鐗╁氨鍙互鎸i挶浜嗭紝蹇墜璧氶挶鐨勬笭閬撳氨鏄帴鍙椾粬浜洪€佺殑绀肩墿銆備笉杩囦富鎾篃寰楁湁鎵嶈壓銆佹湁浜烘皵鎵嶈銆侟/p> 2銆佹墦寮€蹇墜APP锛岃繘鍏ュ揩鎵嬩富鐣岄潰鍚庣偣鍑诲乏涓婅涓夋潯鏉狅紝寮瑰嚭渚ц竟鑿滃崟鏍忓悗鍚戜笂婊戝姩鐐瑰嚮銆愬揩鎵嬪皬搴椼€戙€傝繘鍏ュ揩鎵嬪皬搴椾富鐣岄潰鍚庣偣鍑诲乏涓婂簳閮ㄣ€愬崠瀹剁銆戙€傚~鍐欏熀鏈俊鎭劧鍚庣偣鍑汇€愭彁浜ゃ€戝氨鍙互浜嗐€侟/p> 3銆佸揩鎵嬪ぇ姒傞』瑕佹渶浣庝竴鍗冪矇涓濆彲浠ュ崠璐ф專閽便€傚洜涓洪鐩棶鐨勬槸鈥滃崠璐ф專閽扁€濓紝鎵€浠ヨ鍋囧浣犲彧鏈夊嚑鐧惧涓矇涓濈殑璇濓紝涓嶄竴瀹氳兘纭繚濂逛滑浼氳璐綘鐨勫晢鍝侊紝鎵€浠ヨ鏈€浣?000涓殑璇濅細鏇翠繚闄╀竴浜涖€侟/p> 4銆佸揩鎵嬪ぇ姒傞』瑕佹渶浣庝竴鍗冪矇涓濆彲浠ュ崠璐ф專閽便€傛牴鎹煡璇㈢浉鍏冲叕寮€淇℃伅锛岀矇涓濇暟1000寮€閫氬晢鍝佹┍绐楀姛鑳戒互鍚庯紝鍗冲彲鐩存帴鑾峰彇鎶栭煶鐩存挱甯﹁揣鏉冮檺銆傚紑鍚洿鎾甫璐т箣鍓嶏紝闇€瑕佸厛鍦ㄥ晢鍝佹┍绐椾笂鏋跺晢鍝侊紝鐒跺悗鍐嶇洿鎾棿娣诲姞鍟嗗搧銆侟/p> 5銆佸揩鎵?000绮変笣鍙互鎸傚皬榛勮溅鍗栬揣鎸i挶銆傛牴鎹煡璇㈢浉鍏充俊鎭樉绀?绮変笣瀵逛簬杈句汉鏉ヨ鏄笉鍙互寮€閫氬皬榛勮溅鐨勶紝鏄』瑕佹湁1000鐨勭矇涓濇暟鐩殑锛岃揪浜哄甫璐ч』瑕佺矇涓濊姹備箣鍚庡紑閫氬晢鍝佸垎浜姛鑳斤紝鐒跺悗鍐嶆妸灏忛粍杞︽寕涓娿€侟/p> 6銆佸彲浠ャ€傘€婂揩鎵嬨€嬫槸涓€涓蒋浠讹紝璇ヨ蒋浠舵弧1鍗冪矇涓濆彲浠ュ紑搴楀崠寰堝涓滆タ锛屾槸绯荤粺瑙勫畾鐨勩€傛杞欢鏄潰鍚戞櫘閫氫汉鐨勮褰曞拰鍒嗕韩鐢熸椿鐨勭煭瑙嗛绀句氦骞冲彴銆侟/p> 蹇墜濡備綍鍑忓厤1000绮変笣寮€閫氬洟璐?/p> 瀹屽杽鍥㈣喘鍟嗗搧鍜岃缃細涓€鏃﹀洟璐紑閫氱敵璇烽€氳繃鍒濆锛屼綘鍙互杩涗竴姝ュ缓绔嬪洟璐晢鍝佺殑璁剧疆銆傝繖鍖呮嫭涓婁紶鍟嗗搧鍥剧墖銆佸~鍐欏晢鍝佹弿杩般€佽缃洟璐环閽便€佽缃洟璐湡闄愮瓑銆侟/p> 璁剧疆浼樻儬鍒稿拰鍟嗗搧锛氭牴鎹綘鐨勫洟璐椿鍔ㄩ渶姹傦紝鍦ˋPP涓缃紭鎯犲埜鍜屽弬涓庣殑鍟嗗搧銆傜‘淇濅环閽卞拰鎻忚堪鍑嗙‘鏃犺锛屽苟璁剧疆濂藉簱瀛樻暟鐩€ 娲诲姩鎺ㄥ箍鍜岀鐞嗭細涓€鏃︿綘鐨勬湰鍦板洟璐椿鍔ㄥ垱寤哄畬鎴愶紝鍙互浣跨敤蹇墜鍟嗗APP杩涜鎺ㄥ箍鍜岀鐞嗐€侟/p> 寮€閫氬揩鎵嬪洟璐楠ゅ涓嬶細鎵撳紑蹇墜APP锛岃繘鍏モ€滄垜鐨勨€濋〉闈€傜偣鍑烩€滆缃€濆浘鏍囷紝杩涘叆鈥滆处鎴蜂笌瀹夊叏鈥濋〉闈€傜偣鍑烩€滃洟璐璇佲€濓紝杩涘叆鍥㈣喘璁よ瘉椤甸潰銆傛寜鐓ф彁绀哄~鍐欑浉鍏充俊鎭紝鍖呮嫭浼佷笟鍚嶇О銆佽惀涓氭墽鐓х瓑銆侟/p>
友情链接